The Search for Value based on Wine Spectator reviews

Justin Meisenhelter

Table Of Contents

Import Modules and Data

Initial Data Cleanup

NA Values in the following fields:

Price

        np.sum(reviews['price'].isnull())
        total = 8996

Solution: Drop (price is going to be a key observation for this dataset) considered interpolate
but standard deviation of price column is very high, with more time could interpolate better

Province

       np.sum(reviews['prvoince'].isnull())  
       total = 63

Solution: ignore (secondary information)

Taster Name

        np.sum(reviews['taster_name'].isnull())
        total = 26244

Solution: Drop (untrustworthy)

Variety

     np.sum(reviews['variety'].isnull())
     total = 1

Investigate: reviews[reviews.variety.isnull()] would be dropped anyway, taster_name = NaN

Broad Stroke Analysis

  1. reviews.points.describe()
  2. reviews.price.describe()

Big jump after the 75% quantile, lets look at whats above that. 25 to 42 could be a good bucket

Conclusion: Winespectator primarily reviews moderately priced wines, but seems to be a good distribution above that, with nearly 23,000 reviews above the 75% quantile. Wine price between 42 to 79 could be a good bucket.

Simple Correlation Analysis

Let's do some basic relationship analysis. start with price v points

Lots of price outliers, lets look at the bulk of our data with price less than 1000

Not a bad logrithmic fit at first glance

Lets look at the 5 most popular regions and see if any other trends emerge
Looking at value_counts() gives us the most popular regions as: US, France, Italy, Spain, Portugal, Chile.
Lets filter our DF

Interesting...the trendline for italy is significantly different from the other 4 countries let's group by these 5 countries and look at the stats

Nothing is immediately obvious, Itallian wines have a slight bias towards being more expensive than the other countries

Hypothesis: Italian wines have lower value(cost/score ratio) than the other 4 sampled countries. We will dive into this later.

Analysis: Initial Wine Value Exploration

Let's breakup wine price into buckets. No general consensus seems to exist in the wine industry as to what price buckets would be apporpriate, so let's use the data itself by quantile:

Let's look at a boxplot of the price buckets to see how our data is distributed

Interesting...clear linear trend of rating versus price.
inner quantiles spread seems to be consistent except for the 'expensive' bucket, it is noticebly larger.
let's see if individual countries are scewing any of this data

US wines seem to be the least conistent with the largest average spread over price.
Itallian wines get more inconsistant as price increases
Spanish wines are fairly consistant over all price values
french wines are fairly consistant over all price values but get more consistant in the outlandish category

Reviewer Biases

Before we continue our quest for value in wine, we need to look for any obvious bias in the review scores

Let's look at some of the most prolific reviwers and their most reviewed regions, and see if they match the general statistics of the region as a whole

Find the top 10 reviwers

The top 10 reviewers are: Roger Voss, Michael Schachner, Kerin O’Keefe, Virginie Boone, Paul Gregutt, Matt Kettmann, Joe Czerwinski, Sean P. Sullivan, Anna Lee C. Iijima, Jim Gordon

some reviwers do not have enough reviews in a country to look for bias, we will exclude them include List:
Chile: Michael Schachner has done 95% of reviews for this region
France: [Joe Czerwinski, Roger Voss] Roger Voss has done 95% of reviews for this region
Italy: [Joe Czerwinski, Michael Schachner,Roger Voss]
Spain: Michael Schachner has done 99% of reviews for this country
US: [Anna Lee C. Iijima, Jim Gordon, Joe Czerwinski, Matt Kettmann, Michael Schachner, Paul Gregutt, Sean P. Sullivan]

Italian Wine Review, Search for Bias

Joe Czerwinski seems to be biased against itallian wines, Roger Voss' reviews seem to line up with the general itallian wine reviews, while Michael Schachner seems to have a slight bias favoring itallian wines.
Let's dig a little deeper to see if there is some selection bias happening

Let's look at Joe Czerwinski and Michael Schachner's reviews for Itally and see if they reviewed wine across multiple price buckets

Conclusion

Both tasters bias' can be explained by their lop-sided selection of cheap wine in Joe Czerwinski's case, and more expensive wines in Michael Schachner's case

United States wine review, Search for Bias

Could be some bias, let's break it down by price bucket

Conclusion

On the surface, it looks like all biases can be explained by wine selection
For most popularly reviewed countries a significant proportion of the reviews are performed by one taster, making it impossible to search for significant bias in those reviews
Let's dig a little deeper and do some math to see if there is any subtle bias

The Search For Bias Continues: Numerical Analysis

Let's start with Anna Lee C. Iijima, for reviewers with a significant amount of US wine reviews (>100) she appears to be the furtherest away from the mean.
Let's sample our table with the same number of wines from each price bucket that Anna reviewed and perform a two tailed T test and see if her mean differs significantly from the total average.
We are testing the hypothesis that a reviewers mean ratings do not reflect the mean average of the total scores per country.
We will reject the null hypothesis at a p-value of 0.05.

Let's do that T-Test

That's a low value...I think we can reasonably assume Anna does not have significant bias against the US, as long as our sampling method is reasonable.

That was easy, let's weaponize that logic. Let's use this method to find any potential bias for the top 10 reviewers in the top 10 countries for reviewers with over 100 ratings in that country.

Conclusion I hae to say I am impressed. Wine Spectator reviewers are remarkabley consistant measured by this method. I feel confident there is no obvious bias from one reviewer over another by region.

The Search for Value

I will take the 7 most common pairing suggestions and try to find a region for the top 5 price buckets that have the highest mean rating per dollar spent

Start by looking at the varieties of wine with more than 200 entries (need a reasonably sized list, and large sample size)

I will use my own knowledge of the subject of wine pairing to create a dictionary of pairing classification and appropriate wines to serve.

Take it for a spin. Let's look at the top 5 value pairings for Fish and Inexpensive wines

Let's make a value dataframe with all permutations of pairings and price bucket
Note: Had to remove Outlandish price bucket, there are not enough reviews for certain pairings to create meaningful results

Let's try to visualize this data

Lets try to visualize how often these countries show up in our value dataframe

Conclusion: This table gives us a good idea about what countries to consider purchasing a bottle from if we are searching for high value wine

Finding Exceptional Wines in the United States

We just discovered the United States is the country with the highest number of entries in our value dataframe. Lets try to find which specific regions and varietals over perform.

Filter our dataframe for wines from the United States

Let's further filter the list to varietals with over 200 reviews for better visualization and statistical treatment

Lets see how US vatietals perform against the rest of the world in each price bucket

There is a lot of data here, I decided to break it down into 5 graphs by price bucket.

IT seems like it would be very difficult to off set one plot from another, let's try a numerical approach to find US wine varietals that score on average better than the rest of the world by price bucket

Need to split groupby into to dataframes of US and world, then join on price bucket and variety

lets only take rows with a value greater than zero (varieties where the US performs better)

This is a little difficult to read, let's combine with some color!

The Us seems to perform the most consistantly on the 'cheap' wines. US 'Expensive' wines, when they do perform better than the world average, are significantly better.

Let's find the regions where the US is making this great wine

Conclusion: If you are looking for wine that scores higher than the world average, buy California!

Hidden Gems

We've spent a lot of time looking at countries known for their wine production, let's look at some lesser known areas

Filtering our dataframe to countries with more than 100 reviews and less than 1000 gives us a nice little sample of 7 countries to look for quality wine

Let's see what kind of wine is being made in these countries

As expected, a lot of varietals with only a couple of reviews, let's filter again for varietals with more than 10 reviews.

We can work with this

Lets find varieties of wine that are made in these countries and the rest of the world in statistiscally significant quantities and compare ratings.

This is the same thing as the last section.....dump it

Are you OK, Italy?

In our early analysis we discovered that out of the 5 most reviewed countries italy appeared to score the lowest, lets look further into that.

Alright, it's not just one price bucket...are there certain regions or varietals pulling the review scores down?

Conclusion: These are the regions and associated prices to avoid when purchasing itallian wine.

Let's look at varietals now.

Conclusion: Don't buy Itallian Montepulciano, and in general avoid grapes of French origin for 'Pricey' wines.